21 research outputs found
(Commercial) Automatic Speech Recognition as a Tool in Sociolinguistic Research
As speech datasets used in sociolinguistic research increase in size, laborious and time-intensive manual orthographic transcription is a challenge, limiting the amount of (transcribed) data which can be analysed. In this paper, I discuss the use of (commercial) automatic speech recognition (ASR) as a tool in sociolinguistic research in the context of a case study: the Lothian Diary Project. I describe the kinds of errors produced by two commercial ASR systems for British English within the broader context of algorithmic bias in ASR, and suggest some best practices when working with ASR in sociolinguistic work
Language variation, automatic speech recognition and algorithmic bias
In this thesis, I situate the impacts of automatic speech recognition systems in relation to sociolinguistic theory (in particular drawing on concepts of language variation, language ideology
and language policy) and contemporary debates in AI ethics (especially regarding algorithmic
bias and fairness). In recent years, automatic speech recognition systems, alongside other
language technologies, have been adopted by a growing number of users and have been embedded in an increasing number of algorithmic systems. This expansion into new application
domains and language varieties can be understood as an expansion into new sociolinguistic
contexts. In this thesis, I am interested in how automatic speech recognition tools interact
with this sociolinguistic context, and how they affect speakers, speech communities and their
language varieties.
Focussing on commercial automatic speech recognition systems for British Englishes, I first
explore the extent and consequences of performance differences of these systems for different user groups depending on their linguistic background. When situating this predictive bias
within the wider sociolinguistic context, it becomes apparent that these systems reproduce and
potentially entrench existing linguistic discrimination and could therefore cause direct and indirect harms to already marginalised speaker groups. To understand the benefits and potentials
of automatic transcription tools, I highlight two case studies: transcribing sociolinguistic data
in English and transcribing personal voice messages in isiXhosa. The central role of the sociolinguistic context in developing these tools is emphasised in this comparison. Design choices,
such as the choice of training data, are particularly consequential because they interact with existing processes of language standardisation. To understand the impacts of these choices, and
the role of the developers making them better, I draw on theory from language policy research
and critical data studies. These conceptual frameworks are intended to help practitioners and
researchers in anticipating and mitigating predictive bias and other potential harms of speech
technologies. Beyond looking at individual choices, I also investigate the discourses about language variation and linguistic diversity deployed in the context of language technologies. These
discourses put forward by researchers, developers and commercial providers not only have a
direct effect on the wider sociolinguistic context, but they also highlight how this context (e.g.,
existing beliefs about language(s)) affects technology development. Finally, I explore ways of
building better automatic speech recognition tools, focussing in particular on well-documented,
naturalistic and diverse benchmark datasets. However, inclusive datasets are not necessarily
a panacea, as they still raise important questions about the nature of linguistic data and language variation (especially in relation to identity), and may not mitigate or prevent all potential
harms of automatic speech recognition systems as embedded in larger algorithmic systems
and sociolinguistic contexts
The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR
English is the most widely spoken language in the world, used daily by
millions of people as a first or second language in many different contexts. As
a result, there are many varieties of English. Although the great many advances
in English automatic speech recognition (ASR) over the past decades, results
are usually reported based on test datasets which fail to represent the
diversity of English as spoken today around the globe. We present the first
release of The Edinburgh International Accents of English Corpus (EdAcc). This
dataset attempts to better represent the wide diversity of English,
encompassing almost 40 hours of dyadic video call conversations between
friends. Unlike other datasets, EdAcc includes a wide range of first and
second-language varieties of English and a linguistic background profile of
each speaker. Results on latest public, and commercial models show that EdAcc
highlights shortcomings of current English ASR models. The best performing
model, trained on 680 thousand hours of transcribed data, obtains an average of
19.7% word error rate (WER) -- in contrast to the 2.7% WER obtained when
evaluated on US English clean read speech. Across all models, we observe a drop
in performance on Indian, Jamaican, and Nigerian English speakers. Recordings,
linguistic backgrounds, data statement, and evaluation scripts are released on
our website (https://groups.inf.ed.ac.uk/edacc/) under CC-BY-SA license.Comment: Accepted to IEEE ICASSP 202